This document is the summary of the R for Data Analysis workshop.
All correspondence related to this document should be addressed to:
Omid Ghasemi (Macquarie University, Sydney, NSW, 2109, AUSTRALIA)
Email: omidreza.ghasemi@hdr.mq.edu.auArtwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations
R can be used as a calculator. For mathematical purposes, be careful of the order in which R executes the commands.
10 + 10
## [1] 20
4 ^ 2
## [1] 16
(250 / 500) * 100
## [1] 50
R is a bit flexible with spacing (but no spacing in the name of variables and words)
10+10
## [1] 20
10 + 10
## [1] 20
R can sometimes tell that you’re not finished yet
10 +
How to create a variable? Variable assignment using <- and =. Note that R is case sensitive for everything
pay <- 250
month = 12
pay * month
## [1] 3000
salary <- pay * month
Few points in naming variables and vectors: use short, informative words, keep same method (e.g., you can use capital letters but it is not recommended, use only _ or . ).
Function is a set of statements combined together to perform a specific task. When we use a block of code repeatedly, we can convert it to a function. To write a function, first, you need to define it:
my_multiplier <- function(a,b){
result = a * b
return (result)
}
This code do nothing. To get a result, you need to call it:
my_multiplier (a=2, b=4)
## [1] 8
# or: my_multiplier (2, 4)
We can set a default value for our arguments:
my_multiplier2 <- function(a,b=4){
result = a * b
return (result)
}
my_multiplier2 (a=2)
## [1] 8
# or: my_multiplier (2)
# or: my_multiplier (2, 6)
Fortunately, you do not need to write everything from scratch. R has lots of built-in functions that you can use:
round(54.6787)
## [1] 55
round(54.5787, digits = 2)
## [1] 54.58
Use ? before the function name to get some help. For example, ?round. You will see many functions in the rest of the workshop.
function class() is used to show what is the type of a variable.
TRUE, FALSE can be abbreviated as T, F. They has to be capital, ‘true’ is not a logical data:class(TRUE)
## [1] "logical"
class(F)
## [1] "logical"
class(2)
## [1] "numeric"
class(13.46)
## [1] "numeric"
class("ha ha ha ha")
## [1] "character"
class("56.6")
## [1] "character"
class("TRUE")
## [1] "character"
Can we change the type of data in a variable? Yes, you need to use the function as.---()
as.numeric(TRUE)
## [1] 1
as.character(4)
## [1] "4"
as.numeric("4.5")
## [1] 4.5
as.numeric("Hello")
## Warning: NAs introduced by coercion
## [1] NA
When there are more than one number or letter stored. Use the combine function c() for that.
sale <- c(1, 2, 3,4, 5, 6, 7, 8, 9, 10) # also sale <- c(1:10)
sale <- c(1:10)
sale * sale
## [1] 1 4 9 16 25 36 49 64 81 100
Subsetting a vector:
days <- c("Saturday", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
days[2]
## [1] "Sunday"
days[-2]
## [1] "Saturday" "Monday" "Tuesday" "Wednesday" "Thursday" "Friday"
days[c(2, 3, 4)]
## [1] "Sunday" "Monday" "Tuesday"
my_vector with numbers from 0 to 1000 in it and calculate mean, median, sd, min, max, and sum of that vector:my_vector <- (0:1000)
mean(my_vector)
## [1] 500
median(my_vector)
## [1] 500
min(my_vector)
## [1] 0
range(my_vector)
## [1] 0 1000
class(my_vector)
## [1] "integer"
sum(my_vector)
## [1] 500500
sd(my_vector)
## [1] 289.1081
List allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other list.
my_list = list(sale, 1, 3, 4:7, "HELLO", "hello", FALSE)
my_list
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 4 5 6 7
##
## [[5]]
## [1] "HELLO"
##
## [[6]]
## [1] "hello"
##
## [[7]]
## [1] FALSE
Factors store the vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character. For example, variable gender with “male” and “female” entries:
gender <- c("male", "male", "male", " female", "female", "female")
gender <- factor(gender)
R now treats gender as a nominal (categorical) variable: 1=female, 2=male internally (alphabetically).
summary(gender)
## female female male
## 1 2 3
gender
## [1] male male male female female female
## Levels: female female male
So, be careful of spaces!
rep() function):gender <- c(rep("male",30), rep("female", 40))
gender <- factor(gender)
gender
## [1] male male male male male male male male male male
## [11] male male male male male male male male male male
## [21] male male male male male male male male male male
## [31] female female female female female female female female female female
## [41] female female female female female female female female female female
## [51] female female female female female female female female female female
## [61] female female female female female female female female female female
## Levels: female male
There are two types of categorical variables: nominal and ordinal. How to create ordered factors (when the variable is nominal and values can be ordered)? We should add two additional arguments to the factor() function: ordered = TRUE, and levels = c("level1", "level2"). For example, we have a vector that shows participants’ education level.
edu<-c(3,2,3,4,1,2,2,3,4)
education<-factor(edu, ordered = TRUE)
levels(education) <- c("Primary school","high school","College","Uni graduated")
education
## [1] College high school College Uni graduated Primary school
## [6] high school high school College Uni graduated
## Levels: Primary school < high school < College < Uni graduated
patient and control values. Here, the first level is control and the second level is patient. Change the order of levels, so patient would be the first level:health_status <- factor(c(rep('patient',5),rep('control',5)))
health_status
## [1] patient patient patient patient patient control control control control
## [10] control
## Levels: control patient
health_status_reordered <- factor(health_status, levels = c('patient','control'))
health_status_reordered
## [1] patient patient patient patient patient control control control control
## [10] control
## Levels: patient control
Finally, can you relabel both levels to uppercase characters? (Hint: check ?factor)
health_status_relabeled <- factor(health_status, levels = c('patient','control'), labels = c('Patient','Control'))
health_status_relabeled
## [1] Patient Patient Patient Patient Patient Control Control Control Control
## [10] Control
## Levels: Patient Control
All columns in a matrix must have the same mode(numeric, character, etc.) and the same length. It can be created using a vector input to the matrix function.
my_matrix = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, ncol = 3)
my_matrix
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
Data frames can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data type. Let’s create a dataframe:
id <- 1:200
group <- c(rep("Psychotherapy", 100), rep("Medication", 100))
response <- c(rnorm(100, mean = 30, sd = 5),
rnorm(100, mean = 25, sd = 5))
my_dataframe <-data.frame(Patient = id,
Treatment = group,
Response = response)
We also could have done the below
my_dataframe <-data.frame(Patient = c(1:200),
Treatment = c(rep("Psychotherapy", 100), rep("Medication", 100)),
Response = c(rnorm(100, mean = 30, sd = 5),
rnorm(100, mean = 25, sd = 5)))
In large data sets, the function head() enables you to show the first observations of a data frames. Similarly, the function tail() prints out the last observations in your data set.
head(my_dataframe)
tail(my_dataframe)
| Patient | Treatment | Response | |
|---|---|---|---|
| 1 | 1 | Psychotherapy | 30.42391 |
| 2 | 2 | Psychotherapy | 31.39435 |
| 3 | 3 | Psychotherapy | 31.98380 |
| 4 | 4 | Psychotherapy | 30.26156 |
| 5 | 5 | Psychotherapy | 32.89160 |
| 6 | 6 | Psychotherapy | 30.58312 |
| Patient | Treatment | Response | |
|---|---|---|---|
| 195 | 195 | Medication | 18.54472 |
| 196 | 196 | Medication | 25.31502 |
| 197 | 197 | Medication | 22.52473 |
| 198 | 198 | Medication | 22.01121 |
| 199 | 199 | Medication | 29.19932 |
| 200 | 200 | Medication | 30.09011 |
Similar to vectors and matrices, brackets [] are used to selects data from rows and columns in data.frames:
my_dataframe[35, 3]
## [1] 19.96716
my_dataframe[1:10, ]
| Patient | Treatment | Response |
|---|---|---|
| 1 | Psychotherapy | 30.42391 |
| 2 | Psychotherapy | 31.39435 |
| 3 | Psychotherapy | 31.98380 |
| 4 | Psychotherapy | 30.26156 |
| 5 | Psychotherapy | 32.89160 |
| 6 | Psychotherapy | 30.58312 |
| 7 | Psychotherapy | 29.31386 |
| 8 | Psychotherapy | 31.86684 |
| 9 | Psychotherapy | 31.60194 |
| 10 | Psychotherapy | 28.17029 |
How to get only the Response column for all participants?
my_dataframe[ , 3]
## [1] 30.42391 31.39435 31.98380 30.26156 32.89160 30.58313 29.31386 31.86684
## [9] 31.60194 28.17029 40.70102 31.00756 25.84319 29.56527 32.88131 17.93231
## [17] 28.13834 28.38068 29.21312 30.41091 24.42506 28.72800 32.28634 27.02297
## [25] 28.65416 36.01442 35.72138 27.42288 31.63057 34.75769 28.23678 23.25628
## [33] 29.95145 28.87311 19.96716 25.39401 46.97635 26.88618 28.89566 35.20675
## [41] 26.74658 26.21272 30.33895 32.62540 21.64545 23.72600 32.02974 33.69672
## [49] 37.39239 29.75127 34.12932 25.31246 30.94949 17.77188 24.35454 33.22516
## [57] 19.23332 30.87546 33.20793 30.50553 20.04208 41.22914 29.10625 33.77548
## [65] 29.33675 31.01131 32.29470 30.55776 31.00207 29.95526 30.04618 31.56003
## [73] 32.44361 29.62436 20.27625 31.37769 21.42085 37.07180 30.83059 30.90844
## [81] 27.85175 29.43705 26.01446 33.49888 24.77147 32.02480 30.51447 28.56813
## [89] 30.74553 31.14742 36.53496 24.67334 30.76875 38.65474 23.12777 32.82991
## [97] 22.45032 35.49847 29.66192 21.85724 23.21670 18.26624 27.45448 34.09305
## [105] 30.46837 26.77456 25.37303 39.98304 26.29323 30.46605 20.37679 26.87346
## [113] 20.11632 23.56563 33.03331 30.29873 26.43332 25.79077 24.43063 28.59376
## [121] 27.90288 25.23165 19.69473 28.89037 26.06537 19.33005 24.58180 26.43273
## [129] 31.75669 18.81263 13.07141 25.53940 34.57619 24.38474 18.68158 25.61714
## [137] 29.46947 25.14800 31.91151 25.29593 29.32774 28.34083 33.82708 31.55684
## [145] 28.46168 22.96394 24.18663 22.86743 27.04206 23.95114 21.81046 21.36594
## [153] 22.17787 23.33292 33.63888 32.96656 26.29168 26.07104 28.96076 25.24093
## [161] 18.95290 29.48124 19.74454 26.28502 27.13508 22.27233 27.23583 33.87589
## [169] 23.84769 33.74727 20.97366 15.34242 22.37948 19.52661 27.65811 29.67194
## [177] 30.54342 23.39429 26.72360 28.42296 25.03079 23.70069 27.67194 26.06394
## [185] 32.07253 27.28652 25.06557 22.95932 24.46923 36.44290 25.86616 18.43971
## [193] 34.58480 33.33262 18.54472 25.31502 22.52473 22.01121 29.19932 30.09011
Another easier way for selecting particular items is using their names that is more helpful than number of the rows in large data sets:
my_dataframe[ , "Response"]
# OR:
my_dataframe$Response
So far, we created dataframes using data.frame function from the base R. However, a better way to create dataframes is to use the tibble function from tidyverse (see here).